Chapter 4. Clustering and classification

The theme for this week is clustering and classification. This is something totally new for me and let´s see what I will learn this week.

About the data

In this exercise we are examining the Boston dataset from R MASS package. This dataset is about housing values in suburbs of Boston. The data frame has 506 observations and 14 variables.

library(MASS)

#calling for the Boston dataset form MASS package
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

So we have 506 subrubs of Boston and here are the explanations of the different variables:

crim : per capita crime rate by town zn: proportion of residential land zoned for lots over 25,000 sq.ft. indus: proportion of non-retail business acres per town chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) nox: nitrogen oxides concentration (parts per 10 million) rm: average number of rooms per dwelling age: proportion of owner-occupied units built prior to 1940 dis: weighted mean of distances to five Boston employment centres rad: index of accessibility to radial highways tax: full-value property-tax rate per $10,000. ptratio: pupil-teacher ratio by town black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town lstat: lower status of the population (percent) medv: median value of owner-occupied homes in $1000s

#calling for the different packages I might use in this exercise
library(ggplot2)
library(GGally)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------ tidyverse 1.2.0 --
## <U+221A> tibble  1.3.4     <U+221A> purrr   0.2.4
## <U+221A> tidyr   0.7.2     <U+221A> dplyr   0.7.4
## <U+221A> readr   1.1.1     <U+221A> stringr 1.2.0
## <U+221A> tibble  1.3.4     <U+221A> forcats 0.2.0
## -- Conflicts --------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::select() masks MASS::select()
library(corrplot)
## corrplot 0.84 loaded
#checking the summary of the Boston dataset, 
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Here is the summary of the data. Chas is a binary variable. Other variables execpt for the crim and zn seems to be normally distributed (mean and median more or less close to each other).

Looking for the correlations

Let’s also make a graph..

#graphical overview
pairs(Boston)

With the pairs plot it’s not very easy the see any corralations. Let´s study the correlations between the different variables with correlation plot.

#making a correlation matrix and drawing a correlation plot to be able to visualie it better and for easier interpretation

cormatrix <- cor(Boston)
cormatrix %>% round(digits=2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
corrplot(cormatrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

There is strong negative correlation (big red ball), between dis-nox, dis-age adn dis-indus. Meaning that moving furher from Boston employment centers the Nitrogen oxide concentration goes down, the proportion of owner-occupied units built prior to 1940 goes down. This makes sense.

Also lower status of the population (lstat) and median value of owner-occupied homes (medv) have strong neagtive correlation. When the percent of lower status of the population gets bigger the median value of owner-occupied homes in $1000s gets smaller. This also is understandable.

Rad and tax have strong positive correlation, meaning when the index of accessibility to radial highways rises also the full-value property-tax rate per $10,000 rises. Why not?

Getting ready for the analysis

Let’s move furher with the analysis..

I need to standardise the dataset to get normally distributed data. I print the summary of the scaled data set.

#Standardising the dataset
boston_scaled <- scale(Boston)

#printing out summaries of the scaled data
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
#My boston_sclaed data is a matrix and I make it as a data frame for the future
class(boston_scaled)
## [1] "matrix"
boston_scaled<- as.data.frame(boston_scaled)

Now we have scaled the data and as seen in the summary now all the means and medians are close to each other meaning that they are normally distributed and with the help of the scaling this data can be fitted in a model.

Next I will change the continuous crime rate variable in my data set to be a categorical variable. I want to cut the crim variable by quantiles to get the high, low and middle rates of crime into their own categories. I drop the old crim variable from the dataset and replace it with the new crime variable.

#create a quantile vector
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
#and create a categorial variable crime

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label= c("low", "med_low", "med_high", "high"))


table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
library(dplyr)

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865
dim(boston_scaled)
## [1] 506  14

Now I need to divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

#Here I make the train and the test sets. I choose 80% of the observations to the train set and the rest to the test set
dim(boston_scaled)
## [1] 506  14
n <- nrow(boston_scaled)
n
## [1] 506
ind <- sample(n, size = n * 0.8)

dim(ind)
## NULL
#create the train set

train <- boston_scaled[ind,]
str(train)
## 'data.frame':    404 obs. of  14 variables:
##  $ zn     : num  -0.487 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.348 -1.033 1.231 -0.437 -0.376 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.317 -0.386 0.434 -0.144 -0.299 ...
##  $ rm     : num  0.3634 0.0432 -0.6129 -0.4763 0.7065 ...
##  $ age    : num  -0.3153 0.1714 0.8251 0.4769 0.0968 ...
##  $ dis    : num  1.1739 -0.2268 -0.6521 0.0926 -0.4459 ...
##  $ rad    : num  -0.982 -0.522 -0.522 -0.637 -0.522 ...
##  $ tax    : num  0.0817 -0.6659 -0.0311 -0.6007 -0.1438 ...
##  $ ptratio: num  -1.18 -0.857 -1.735 1.175 1.129 ...
##  $ black  : num  0.365 0.426 0.421 -1.359 0.426 ...
##  $ lstat  : num  -0.561 -0.891 -0.142 2.109 -0.698 ...
##  $ medv   : num  -0.6559 0.2248 0.0182 -1.0148 0.4314 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 3 3 2 3 3 2 1 1 ...
#create the test set
test <- boston_scaled[-ind,]
str(test)
## 'data.frame':    102 obs. of  14 variables:
##  $ zn     : num  0.0487 -0.4872 -0.4872 -0.4872 -0.4872 ...
##  $ indus  : num  -0.476 -0.437 -0.437 -0.437 -0.437 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.265 -0.144 -0.144 -0.144 -0.144 ...
##  $ rm     : num  -0.93 -1.179 -1.017 -0.203 -0.338 ...
##  $ age    : num  1.116 -1.136 1.049 0.822 0.719 ...
##  $ dis    : num  1.086122 0.000692 0.001357 0.086364 0.312653 ...
##  $ rad    : num  -0.522 -0.637 -0.637 -0.637 -0.637 ...
##  $ tax    : num  -0.577 -0.601 -0.601 -0.601 -0.601 ...
##  $ ptratio: num  -1.5 1.18 1.18 1.18 1.18 ...
##  $ black  : num  0.328 -0.741 0.218 0.441 -0.551 ...
##  $ lstat  : num  2.419 -0.135 1.172 0.85 0.648 ...
##  $ medv   : num  -0.656 -0.254 -0.971 -0.797 -0.841 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 2 3 3 3 3 3 3 2 2 2 ...

Linear discriminan analysis (LDA)

Now I’m making a linear discriminant analysis on the train set. I use the categorical crime rate as the target variable and all the other variables are predictor variables. I draw the LDA (bi)plot of the model.

# fit the linear discriminant analysis on the train set. crime as the target variable and all the other variables as predictor variables


lda.fit <- lda(formula= crime ~ ., data = train)


lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# drawing a plot of the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

Then I predict the classes with the LDA model on my test data and cross tabulate the results with the crime categories from the test set.

#saving the crime categories from the test set 
correct_classes <- test$crime

library(dplyr)


#removing the categorial crime variable from the test dataset

test <- dplyr::select(test, -crime)

#Predicting the vallses with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)

#cross tablating the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16       5        3    0
##   med_low    1      19       10    0
##   med_high   0       6       15    0
##   high       0       0        0   27

Here I can see how well my model is working with the predicting. My model works well predicting the high crime rates, byt it makes some errors predicting the other classes. The same phenomena was visible in the train set plot.

Distance measures and clustering

Next I move towards clustering and measure the distances. I use the Euklidean distance, which is possibly the most common one. K-means is old and much used clustering method. Kmeans counts the distance matrix automatically but I have to choose the number of clusters. I tryed to make the model wiht 4 clusters, but for me it seems that 3 clusters works better.

Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.

#Reloading the Boston dataset and standardising the dataset (variables have to be normally ditributed)
dim(Boston)
## [1] 506  14
scale_Boston2 <- scale(Boston)

scale_Boston2 <- as.data.frame(scale_Boston2)


#Calculating the distances. Euklidean distance

dist_eu <- dist(scale_Boston2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# k-means clustering
km <-kmeans(scale_Boston2, centers = 3)


# ploting the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)

pairs(scale_Boston2[1:6], col = km$cluster)

pairs(scale_Boston2[7:13], col = km$cluster)

Next I investigate what is the optimal number of clusters. There are many ways to find the optimal number of clusters but here I will be using the Total of within cluster sum of squares (WCSS) and visualising the result with a plot.

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(scale_Boston2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The optimal number of clusters is when the total WCSS drops radically and above I can see that it happens around x= 2. So the optimal number of clusters would be 2. Next I run the algorithm again with two clusters.

# k-means clustering
km <-kmeans(scale_Boston2, centers = 2)

# plot the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)

pairs(scale_Boston2[1:8], col = km$cluster)

pairs(scale_Boston2[6:13], col = km$cluster)

With the bigger pairs plot is more difficult to interpret the results so I made also two more precise plots to be able to study some effects more precise. In the first plot (where all the variables are included) I can see that the variables chas doesn´t follow any pattern with any of the variables. There are many pairs that doesn´t follow any nice pattern. However I think I might find negative correlation between indus-dis, nox-dis, dis-lstat and positive correlation between indus-nox, age-nox, age-lstat.